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Abstract 

Background: Variations in DNA copy number carry information on the modalities of genome evolution and 
mis-regulation of DNA replication in cancer cells. Their study can help localize tumor suppressor genes, distinguish 
different populations of cancerous cells, and identify genomic variations responsible for disease phenotypes. A number 
of different high throughput technologies can be used to identify copy number variable sites, and the literature 
documents multiple effective algorithms. We focus here on the specific problem of detecting regions where variation 
in copy number is relatively common in the sample at hand. This problem encompasses the cases of copy number 
polymorphisms, related samples, technical replicates, and cancerous sub-populations from the same individual. 

Results: We present a segmentation method named generalized fused lasso (GFL) to reconstruct copy number 
variant regions. GFL is based on penalized estimation and is capable of processing multiple signals jointly. Our 
approach is computationally very attractive and leads to sensitivity and specificity levels comparable to those of 
state-of-the-art specialized methodologies. We illustrate its applicability with simulated and real data sets. 

Conclusions: The flexibility of our framework makes it applicable to data obtained with a wide range of technology. 
Its versatility and speed make GFL particularly useful in the initial screening stages of large data sets. 

Keywords: Copy number variant, Copy number polymorphism, Fused lasso, Group fused lasso, MM algorithm 



Background 

Genomic duplications and deletions are common in can- 
cer cells and known to play a role in tumor progression 
[1]. As our ability to survey the fine scale of the human 
genome has increased, it has become apparent that nor- 
mal cells can also harbor a number of variations in copy 
number (CN) [2,3]- The last few years have witnessed a 
steady increase in our knowledge of the size and frequency 
of these variants [4-7] and their implications in complex 
diseases [8,9]. At the same time, statistical methods and 
algorithms have been developed to better harness the 
information available. At the cost of oversimplification, 
two different approaches have become particularly pop- 
ular. One is based on the hidden Markov model (HMM) 
machinery and explicitly aims to reconstruct the unob- 
servable discrete DNA copy number; the other, which 
we will generically call "segmentation", aims at identifying 
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portions of the genome that have constant copy number, 
without specifically reconstructing it. 

The HMM approach takes advantage of the implicitly 
discrete nature of the copy number process (both when 
a finite number of states is assumed and when, as in 
some implementations, less parametric approaches are 
adopted); furthermore, by careful modeling of the emis- 
sion probabilities, one can fully utilize the information 
derived from the experimental results. In the case of geno- 
typing arrays, for example, quantification of total DNA 
amount, relative allelic abundance, and prior information 
such as minor allele frequencies can be considered. 

No apriori knowledge of the number of copy num- 
ber states is required in the segmentation approach— an 
advantage in the study of cancer where polyploidy and 
contamination with normal tissues result in a wide range 
of fractional copy numbers. Possibly for the reasons out- 
lined, HMMs are the methods of choice in the analysis 
of normal samples [10-14], while segmentation methods 
are the standard in cancer studies [15,16]. A limitation of 
segmentation methods is that they rely on data in which 
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the variation in copy number is reflected in the differences 
in means of the segments. This fact makes segmenta- 
tion methods applicable directly to a substantial portion 
of the data derived from recent technologies, but not to 
relative allelic abundance. However, see the modification 
suggested in [17] and the following description for an 
exception. 

While a number of successful approaches have been 
derived along the lines described above, there is still a 
paucity of methodology for the joint analysis of multiple 
sequences. It is clear that if multiple subjects share the 
same variation in copy number, there exists the poten- 
tial to increase power by joint analysis. Wang et al. [18] 
present a methodology that extends [1] to reconstruct 
the location of tumor suppressor genes from the iden- 
tification of regions lost in a larger number of samples. 
The initial steps of the Birdsuite algorithm rely on the 
identification of suspect signals in the context of multiple 
sample. PennCNV [13] includes an option of joint anal- 
ysis of trios. Methodology to process multiple samples 
within the context of change point analysis has been devel- 
oped in [16,19-21]. Efron and Zhang [22] consider FDR 
analysis of independent samples to identify copy number 
polymorphisms (CNPs), and Nowak et al. [23] use a latent 
feature model to capture, in joint analysis of array-CGH 
data from multiple tumor samples, shared copy number 
profiles, on each of which a fused-lasso penalty is enforced 
for sparsity. 

In the present work we consider a setting similar to 
[16] in that we want joint analysis to inform the seg- 
mentation of multiple samples. Our main focus is the 
analysis of genotyping array data, but the methodology 
we develop is applicable to a variety of platforms. By 
adopting a flexible framework we are able, for example, to 
define a segmentation algorithm that uses all information 
from Illumina genotyping data. As in [19], we are inter- 
ested in the situation where some but not all the samples 
under consideration carry a copy number variant (CNV). 
We prefer to enforce a certain sparsity in the vector that 
identifies which samples carry a given variant. We tackle 
this problem using a penalized estimation approach, orig- 
inally proposed in this context by [24], for which we have 
developed an algorithmic implementation [25]. Appre- 
ciable results are achieved in terms of speed, accuracy, 
and flexibility. 

In concluding this introduction, we would like to make 
an important qualification. The focus of our contribu- 
tion is on segmentation methods, knowing that this is 
only one of the steps necessary for an effective recovery 
of CNVs. In particular, normalization and transforma- 
tion of the signal from experimental sources are crucial 
and can have a very substantial impact on final results, 
as documented in [26-31], for example. Indeed, prepro- 
cessing to eliminate systematic variation in intensities 



is particularly important for joint analysis of multiple 
sequences, when repeated deviances are more likely to 
be interpreted as true signal. Furthermore, calling proce- 
dures that classify results of segmentation while possibly 
controlling global error measures [22] are also needed. 
Indeed, in the data analysis included in this paper, we 
need to resort to both of these additional steps, and we 
will describe then briefly the fairly standard choices we 
make. 

Before describing in detail the proposed methods for 
joint segmentation of multiple sequences, we start by 
illustrating various contexts where joint analysis appears 
to be useful. 

Genotyping arrays and CNV detection 

Genotyping arrays have been used on hundreds of thou- 
sands of subjects. The data collected through them pro- 
vides an extraordinary resource for CNV detection and 
the study of their frequencies in multiple populations. 
Typically, the raw intensity data (representing hybridiza- 
tion strength) is processed to obtain two signals: quantifi- 
cation of total DNA amount (from now on log R Ratio, 
LRR, following Illumina terminology) and relative abun- 
dance of the two queried alleles (from now on B allele 
frequency, BAF). Both these signals contain information 
on CNVs, and one of the strengths of HMMs has been that 
they can easily process them jointly. Segmentation mod- 
els like CBS have traditionally relied only on LRR. While 
this is a reasonable choice, it can lead to substantial loss of 
information, particularly in tumor cells, where polyploidy 
and contamination make information in LRR hard to deci- 
pher. To exploit BAF in the context of a segmentation 
method, a signal transformation has been suggested [17]: 
mirrored BAF (mBAF) relies on exchangeability of the 
two alleles and the low information content of homozy- 
gous SNPs. The resulting mBAF is defined on a coarser 
grid than the original BAF, but is characterized by chang- 
ing means in the presence of a CNV. While [17] shows 
that the analysis of BAF alone can be advantageous and 
more powerful than segmentation of LRR in some con- 
texts, clearly a joint analysis of LRR and mBAF should 
be preferable to an arbitrary selection of one or the other 
signal. 

Multiple platforms 

LRR and BAF are just one example of the multiple 
signals available in some samples. Often, as research pro- 
gresses, the samples are assessed with a variety of tech- 
nologies. For example, a number of subjects who have 
been genotyped at high resolution are now being rese- 
quenced. Whenever the technology adopted generates 
a signal that contains some information on copy num- 
ber, there is an incentive to analyze the available signals 
jointly. 
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Tumor samples from the same patient obtained at different 
sites or different progression stages 

In an effort to identify mutations that are driving a spe- 
cific tumor, as well as study its response to treatment, 
researchers might want to study CNVs in cells obtained at 
different tumor sites or at different time points [32]. Copy 
number is highly dynamic in cancer cells, so that it is to be 
expected that some differences will be detected over time 
or across sites. By contrast, the presence of the same CNVs 
across these samples can be taken as an indication that 
the tumors share the same origin. Therefore, a compara- 
tive analysis of CNV can be used to distinguish resurgence 
of the same cancer from insurgence of a new one, or to 
identify specific cancer cell populations. Given that the 
tissue extracted always consists of a mixture of normal 
and cancer cells, which are in turn a mixture of different 
populations, joint analysis of the signals from the varied 
materials is much more likely to lead to the identification 
of common CNVs when these exist. 

Related subjects 

Family data is crucial in genetic investigations, and hence 
it is common to analyze related subjects. When study- 
ing individuals from the same pedigree, it is reasonable to 
assume that some CNVs might be segregating in multi- 
ple people and that joint analysis would reduce Mendelian 
errors and increase the power of detection. 

The rest of the paper is organized as follows: In the 
Methods section, we first present the penalized estimation 
framework, and then describe how the model can be used 
for data analysis by: (a) outlining an efficient estimation 
algorithm, (b) generalizing it to the case of uncoordinated 
data, and (c) describing the choice of the penalization 
parameters. In the results section, we discuss our find- 
ings on two simulated data sets (descriptive of normal and 
tumor samples) and two real data sets. In one case mul- 
tiple platforms are used to analyze the same sample, and 
in the other case samples from related individuals benefit 
from joint analysis. 

Methods 

A model for joint analysis of multiple signals 

Assume we have observed M signals, each measured at 
N locations, corresponding to ordered physical positions 
along the genome, with yg being the observed value of 
sequence i at location The copy number process can be 
modeled as 



Jij = Pij + 



(1) 



where 6// represent noise, and the mean values 
are piece-wise constant. Thus, there exists a linearly 
ordered partition [R^fR®, . . . of the location index 
{1,2, . . . ,N] such that p is = • • • = p it = fif for e 



R® and 1 < k < K{. In other words, most of the incre- 
ments — are assumed to be zero. When two 
sequences k and / share a CNV with the same bound- 
aries at location both \Pq — fik,j-i\ an d \Pij — Pl,j-i\ 
will be different from zero at the change point Mod- 
ulo an appropriate signal normalization, fty = 0 can 
be interpreted as corresponding to the appropriate nor- 
mal copy number equal to 2. We propose to reconstruct 
the mean values P by minimizing the following function, 
called hereafter the generalized fused lasso (GFL): 
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which includes a goodness-of-fit term and three penal- 
ties, whose roles we will explain one at a time. The 
i\ penalty YlfLi Yl^Li I Pij I enforces sparsity within ft, in 
favor of values pij = 0, corresponding to the normal 
copy number. The total variation penalty Yj!=2 \Pij~Phj-l I 
minimizes the number of jumps in the piece-wise con- 
stant means of each sequence and was introduced by [24] 
in the context of CNV reconstruction from array-CGH 
data. Finally, the Euclidean penalty on the column vec- 
tor of jumps ^J2fLi(Pij ~ Pi,j-i) 2 is a form of the group 
penalty introduced by [33] and favors common jumps 
across sequences. As clearly explained in [34], "the local 
penalty around 0 for each member of a group relaxes as 
soon as the — for one member i of the group 

moves off 0." Bleakley and Vert (2011) [35] also suggested 
the use of this group-fused-lasso penalty to reconstruct 
CNV We consider here the use of both the total varia- 
tion and the Euclidean penalty on the jumps to achieve 
the equivalent effect of the sparse group lasso, which, as 
pointed out in [36], favors CNV detection in multiple sam- 
ples, allowing for sparsity in the vector indicating which 
subjects are carriers of the variant. This property is impor- 
tant in situations of multiple tumor samples and related 
subjects, where one does not want to assume that all the 
M sequences carry the same CNV. 

The incorporation of the latter two penalties can also 
be naturally interpreted in view of image denoising. 
To restore an image disturbed by random noise while 
preserving sharp edges of items in the image, a 2- 
D total variation penalty ^YaLi Ylf=2 \Pij ~ Phj-i\ + 

PYlf=iY^2\Pij - Pi-hj\ is proposed in a regularized 
least-square model [37], where pij is the true underlying 
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intensity of pixel (/,/*). In CNV detection problems, sig- 
nals from multiple sequences can be aligned in the shape 
of an image, except that pixels in each sequence are lin- 
early ordered while sequences as a group have no certain 
order a priori. Thus, one of the two total variation penal- 
ties is replaced by the group penalty on the column vector 
of jumps. 

Using matrix notation, and allowing the tuning parame- 
ter X i, A 2 and A3 to be sequence specific, we can reformu- 
late the objective function as follows. Let Y = (jij)MxN 
and ft = (Pij)MxN- Let fi t be the ith row of and fi^ the 
yth column of Also, let X% = (X^Mxi- Then we have 

M 



f(P) = h\Y-p\\l + J2 x ^i^i 



i=l 
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+ I>2,il 
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where 1 1 • | \p is the Frobenius matrix norm, 1 1 • | ki and \ \-\\t 2 
are the i\ and l<i vector norms, /$ is:t indicates the sub- 
vector with elements /3; )5 , . . . , pi )t of the row vector fi it and 
V denotes entry- wise multiplication between two vec- 
tors. It would be easy to modify the tuning parameters so 
as to make them location specific by reducing the penalty 
for a jump in genomic regions known to harbor CNVs. 

An MM algorithm 

While the solution to the optimization problem (3) might 
have interesting properties, this approach is useful only if 
an effective algorithm is available. The last few years have 
witnessed substantial advances in computational meth- 
ods for €i-regularization problems, including the use of 
coordinate descent [38,39] and path following methods 
[35,40-42]. The computational complexity of these meth- 
ods in the best situation is 0(MNK), where K indicated 
the number of knots along the solution path. Here knots 
are conjunction points between a series of piecewise func- 
tions of tuning parameters. It is important to note that 
these algorithms - some of which are designed for more 
general applications - may not be the most efficient for 
large scale CNV analysis for at least two reasons. On the 
one hand, reasonable choices of X might be available, mak- 
ing it unnecessary to solve for the entire path; on the other 
hand, the number of knots K can be expected to be as 
large as 0(N), making the computational costs of path 
algorithms prohibitive. 

With specific regard to the fused-lasso application to 
CNV detection, we were successful in developing an algo- 
rithm with per iteration cost 0(N) and empirically fast 
convergence rate for the analysis of one sequence [25] . We 



apply the same principles here and start by modifying the 
norms in the penalty to achieve better computational sta- 
bility. For the l\ norm we substitute ||#||2,e = \/x 2 + e 

for sufficiently small e, and for the l 2 norm we substitute 

1 

I l x l 1 2,e = (S?=i x f + 6 ) 2 • This produces the differentiate 
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Adopting an MM framework [43], we want to find a sur- 
rogate function g e (/$ \ fi( m ^) for each iteration m such that 
ge(f} (m) , fi(m)) =/e( ^(m) ) and&(/J 1 p(m) } > ff(fi) foj . ^ 

At each iteration, the MM principle chooses ^ m+v> = 
argmin | fS^). A majorizing function with the 
above properties is readily obtained using the concavity of 
the square-root function ||#||2,e < 2||z|| 2 ^ ~ anc * 
its vector equivalent | |x| \ 2 , € < ^^(| |x| |^ - | |z| |^). The 
resulting surrogate function 
1 m N 
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separates as a sum of similar functions in the the row 
vectors Pf, namely, 
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Here each A^ is a tridiagonal symmetric matrix, and 
cf 1 ^ is a constant, irrelevant for optimization purposes. 
In view of the strict convexity of the surrogate function, 
each is also positive definite. The nonzero entries of 
A^ m) and b^ m) (i = 1, . . . ,M) are listed in Additional file 
1: Supplementary Text . Each of the surrogate functions 
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in (5) can be minimized by solving the linear system f$i = 
[ p\ m) ] T [ A\ m) ] - 1 by the Tri-diagonal Matrix (TDM) algo- 
rithm [44]. This results in a per-iteration computational 
cost of 0(MN). This algorithm is empirically observed 
to achieve an exponential convergence rate [25], although 
we do not yet have an analytic proof. In practice, this 
method scales well with joint analysis of tens to hundreds 
of samples with measurements at millions of locations, 
with limitations dictated by memory requirements. For 
analysis of real data, we suggest one or a group of sam- 
ples to be analyzed chromosome by chromosome, since a 
CNV region can never extend beyond one chromosome to 
another. Actual computation times are shown along with 
different examples in the results section. Readers might be 
interested in comparing the outlined approach with other 
segmentation methods that are not based on the use of i\ 
penalties, as [45]. 

Stacking observations at different genomic locations 

While copy number is continuously defined across the 
genome, experimental procedures record data at dis- 
crete positions, for which we have used the indexes ; = 
1, . . . , N. In reality, repeated evaluations of the same sam- 
ple (or related samples) will typically result in measure- 
ments at only partially overlapping genomic locations, 
either because different platforms use different sets of 
probes or missing data occur at different positions across 
the sequences. For example, for mBAF and LRR measure- 
ments from the same experiment on the same subject, the 
mBAF signal is defined on a subset of the locations where 
the LRR signal is defined. 

Let S indicate the union of all genomic positions where 
some measurement is available among the M signals 
under study. And let Si be the subset of locations with 
measurements in sequence /. We reconstruct for all 
j G <S. When j £ Si, fiij will be determined simply on the 
basis of the neighboring data points, relying on the regu- 
larizations introduced in (3). The goodness-of-fit portion 
of the objective function is therefore redefined as 

(6) 

The MM strategy previously described applies with 
slight modifications of the matrix (see Additional file 
1: Supplementary Text). 

The attentive reader will have noted that the ytj values 
with ; ^ Si can be considered missing data, and evalua- 
tion of the missingness pattern is appropriate. In general, 
the jij cannot be considered missing at random. The most 
important example is the case of mBAF, where homozy- 
gous markers result in missing values. Homozygosity is 



clearly more common when copy number is equal to 1 
than when copy number is equal to 2. Therefore, there 
is potentially more information on to be extracted 
from the signals than what we capture with the proposed 
method. Although most of the information on deletions is 
obtained through LRR, BAF does convey additional infor- 
mation on duplications, where the changes in LRR are 
limited by saturation effects. On the other hand, it does 
appear that our method does not increase the rate of false 
positives. Hence, it can be considered as an operational 
improvement over segmentation based on LRR only, even 
if in theory, it does not completely use the information on 
BAF. 

Choice of tuning constants and segmentation 

One of the limitations of penalization procedures is that 
values for the tuning parameters need to be set, and 
clear guidelines are not always available. Path methods 
that obtain a solution of the optimization problem (3) for 
every value of a tuning parameter can be attractive, but 
recent algorithmic advances [35,41,42] remain impracti- 
cal for problems of the size of ours. A number of recent 
publications obtain optimal values of penalty parameters 
under a series of conditions [46-49]. We rely on these find- 
ings to propose a strategy for obtaining a solution of (3) 
for reasonably liberal values of the tuning parameters, fol- 
lowed by a sequence-by-sequence hard thresholding of 
the detected jumps with a data-adaptive threshold. 

We have found the following guidelines to be useful in 
choosing penalty parameter values: 

Hi = 

Hi = p(p)c20iy/logN, (7) 
Hi =i 1 - P(P)] c<$OiJpM^\o%N, 

for i = 1,...,M, where 07 is a robust estimate of stan- 
dard deviation of y;, p is roughly the proportion of the 
M sequences we anticipate to carry CNVs, and c\, C2 and 
C3 are positive multipliers adjusted to account for differ- 
ent signal-to-noise ratios and CNV sizes. We discuss the 
function p below. 

While a more rigorous justification is provided in the 
Additional file 1: Supplementary Text, we start by under- 
scoring some of the characteristics of this proposal. 

• The sequence-specific penalizing parameters are 
proportional to an estimate of the standard deviation 
of the sequence signal. In other words, after initial 
normalization, the same penalties would be used 
across all signals. 

• The tuning parameter for the total variation (fused 
lasso) and the Euclidean (group fused lasso) penalties 
on the jumps depend on ^logAf, where N is the 
possible number of jumps. This has a "multiple 
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comparison controlling" effect and resembles rates 
that have been proven optimal under various sparse 
scenarios [46-49]. This term does not appear in the 
expression of X i, as the lasso penalty can be 
understood as providing a soft thresholding of the 
solution of (3) when X\ = 0. Given the penalization 
due to A 2 and A3, the solution of (3) when k\ = 0 will 
have much smaller dimension than N. 

• The group penalty depends on *JM, where M is the 
number of grouped sequences, as in the original 
proposal [33]. 

• The relative weight of the fused-lasso and 
group-fused-lasso penalties is regulated by p, which 
depends on p, the proportion of the M sequences 
expected to carry the same CNV. For example, if 

M = 2 and the two sequences record LRR and mBAF 
from the same individual, one expects p = 1 with 
p = 0, enforcing jumps at identical places in the two 
signals. Or, when studying the sequences relative to 
one parent and her/his offspring, it is reasonable to 
set p = 1 /2, reflecting the fact that the two share half 
of their genome. For completely unrelated sequences, 
p = 0 and p = 1. In this setting, p is denned by 
genome-wide characteristic of the samples. However, 
when analyzing specific genomic segments, 
corresponding to copy number polymorphisms, it is 
possible to use p to reflect the population frequencies 
of each of the copy number variants. We do not 
consider the problem of estimating a variable p in the 
present work. 

The standard deviation 67 can be estimated robustly as 
follows. Let Ay = — yi,j, for; = 1, . . . , N — 1, be the 
first-order differences of adjacent yy for sequence /. Then 
most Var(A^) = 2a? except those bridging real change 
points, so we can take 

o t = SD(A/)/V2, 

where SD(Ai) = Standard Deviation(A;) or SD(Ai) = 
Median Absolute Deiviation(A;) for A/ = {Aj,i, . . . , 
A;,iv-i}. 

As mentioned before, the exact values of the penalty 
parameters should be adjusted depending on the expecta- 
tions of signal strengths. Following the approach in [50], 
one can approximate the bias induced by each of the 
penalties and hence work backwards in terms of accept- 
able levels. As detailed in Additional file 1: Supplementary 
Text, 

Bias(Ai) oc A.i 

Bias(A<2) oc A^/Length of segment 
Bias(A<3) oc A3/ (Length of segment 

sequences sharing segment J 



Following again the approach in [50], one can show that 
under some relatively strong assumptions, the choices in 
(7) lead to a consistent behavior as N —> 00 and M 
stays bounded (see Additional file 1: Supplementary Text). 
Despite the fact that N is indeed large in our studies, it 
is not clear that we can assume it to be in the asymptotic 
regime. As finer scale measurements become available, 
scientists desire to investigate CNVs of decreasing length. 
The CNVs we are interested in discovering are often cov- 
ered by a small number of probes. Furthermore we have 
often little information on the sizes and frequencies of 
CNVs. In this context, we find it advisable to rely on a 
two-stage strategy: 

1. Sequences are jointly segmented by minimizing (3) 
for a relatively lax choice of the penalty parameters. 

2. Jumps are further thresholded on the basis of a 
data-driven cut-off. 

Step 2 allows us to be adaptive to the signal strength and 
can be carried on with multiple methods. For example, 
one can adopt the modified Bayesian Information Crite- 
ria (mBIC) [51]. For sequence i, the jumps are sorted as 
{di(i), . . . , di(N-i)} i n the descending order of their abso- 
lute values. Then we choose the first k change points 
where k is given by 

k = arg maxmBIC(/c). 

k 

In data analysis, we often apply an even simpler proce- 
dure where the threshold for jumps is defined as a fraction 
of the maximal jump size observed for every sequence. 
Specifically, for sequence /, let A = max2</<jv{|^>'|}> 
where dy = — is the largest observed jump for 

sequence /. Then we define 

Yi = max{^07,min{A',&<7j}}, for a < b, 

as a "ruler" reflecting the scale of a possible real jump size, 
taking cy; as the cut-off in removal of most small jumps. In 
all analyses for this paper, we fix a = 1, b = 5 and c = 0.2. 
In our experience, this heuristic procedure works well for 
both tumor and normal tissue CNV data. 

Calling procedure 

Even if this is not the focus of our proposal, in order 
to compare the performance of our segmentation algo- 
rithm with HMM approaches, it becomes necessary to 
distinguish gains from losses of copy number. While the 
same segmentation algorithm can be applied to a wide 
range of data sets, calling procedures depend more closely 
on the specific technology used to carry out the experi- 
ments. Since our data analysis relies on Illumina genotyp- 
ing arrays, we limit ourselves to this platform and briefly 
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describe the calling procedure adopted in the results 
section. 

Analyzing one subject at the time, each segment with 
constant mean is assigned to one of five possible copy 
number states (c = 0, 1, 2, 3, 4). Let R collect the indexes 
of all SNPs comprising one segment and let (x#,y#) = 
{(Xj,yf),j e R} be the vectors of values for BAF and LRR in 
the segment. On the basis of typical patterns for BAF and 
LRR in the different copy number states (see [10,13,18]), 
we can write log-likelihood ratio 

LR(c) = i og W^ + log WZ£i£l, c = 0) i,3,4, 

(8) 

explicitly denned in Additional file 1: Supplementary Text. 
Segment R is assigned a CNV state c that maximize LR(c), 
only if LR(c) > r\, where r\ is a pre-specified cut-off. 

As noted in [16], the LRR data for a segment with c = 2, 
ideally normalized to have mean 0, often has a small non- 
zero mean due to experimental artifacts. If the number of 
SNPs in R is sufficiently large, the above log-likelihood- 
ratio criterion would result in the erroneous identification 
of a copy number different from 2. To avoid this, we also 
require that the size of the absolute difference of the mean 
of LRR from zero be larger than a threshold \y^\ > r 2 o '. 

Results and discussion 

We report the results of the analysis of two simulated and 
two real data sets, which overall exemplify the variety of 
situations where joint segmentation of multiple sequences 
is attractive, as described in the motivation section. In 
all cases, we compare the performance of the proposed 
procedure with a set of relevant, often specialized, algo- 
rithms. The penalized estimation method we suggest in 
this paper shows competitive performance in all cases and 
often a substantial computational advantage. Its versatility 
and speed make it a very convenient tool for initial explo- 
ration. To calibrate the run times reported in the sequel, 
we state for the record that all of our analyses were run on 
a Mac OS X (10.6.7) machine with 2.93 GHz Intel Core 2 
Duo and 4 GB 1067 MHz DDR3 memory. 

Simulated CNV in normal samples 

We consider one of the simulated datasets described 
in [25] with relatively short deletions and duplications 
(300 comprising 5, 10, 20, 30, 40, 50 SNPs each) are 
inserted in the middle of 13000 SNPs long sequences, 
using a combination of male and female X chromosome 
data from the Illumina HumanHap550 array, appropri- 
ately pre-processed to avoid biases. These steps included a 
scrambling of SNP positions so as to avoid long-range sig- 
nal fluctuation. This setting mimics the small rare CNVs 
possibly occurring in the genome of normal individuals. 



In our main analysis, therefore, we process one individ- 
ual at the time, reflecting the typical level of information 
available to scientists in these contexts. HMM methods, 
like PennCNV, are expected to be the most effective in 
this problem; segmentation methods like CBS are closer 
to our own and therefore also make an interesting com- 
parison. As repeatedly discussed, the Illumina platform 
produces two signals for one subject, LRR and BAF. A 
segmentation method that can process one signal at the 
time would give its best results using LRR, which car- 
ries most of the information. Given this background, we 
compare four methods: PennCNV, CBS on LRR, fused 
lasso on LRR only, and group fused lasso on LRR and 
mBAF. The implementations we use are those reflected 
in the software packages: PennCNV (version 2010May01), 
R package DNAcopy for CBS (version 1.24.0) [52] and 
our own R package Piet (version 0.1.0). Tuning parame- 
ters for PennCNV and CBS are set at the default values; 
the fused lasso implementation corresponds to \\ = 0.1, 
A.2 = 2 x V13000, and A3 = 0 and the group fused lasso 
to A.i = 0.1, X 2 = 0, and A 3 = 2 x V13000. To call 
deletions and duplications with CBS and the two fused- 
lasso approaches, we use both LRR and BAF data (prior to 
transformation to mBAF) with the following cut-off val- 
ues: r\ = 10 and r2 = 1(1.5) for duplication (deletion). 
Performance is evaluated by the same indexes we used in 
[25], the true positive rate (TPR or sensitivity) and the 
false discovery rate (FDR), all defined on a per SNP basis. 
Results are summarized in Table 1. 

Not surprisingly, all algorithms perform similarly well 
for larger deletions/duplications, and it is mainly for vari- 
ants that involve 10 or fewer SNPs that differences are vis- 
ible. Algorithms that rely only on LRR (for example, CBS 
and fused lasso) underperform in the detection of small 
duplications. Comparison is particularly easy for dupli- 
cations involving 10 SNPs, where the selected parame- 
ter values lead to similar FDRs in the three segmenta- 
tion methods. The group fused lasso can almost entirely 
recover the performance of PennCNV and outperforms 
CBS in this context. 

Out of curiosity, we analyzed all sequences simultane- 
ously. While this represents an unrealistic amount of prior 
information, it allows us to evaluate the possible advan- 
tages of joint analysis. FDR practically became 0 (<0.02%) 
for all CNV sizes, but power increases only for CNVs 
including fewer than 10 SNPs. 

Finally, it is useful to compare running times. Summary 
statistics of the per sample time are reported in Table 1. 
While all algorithms are rather fast, the two implementa- 
tions of the fused lasso dominate. 

A simulated tumor data set 

To explore the challenges presented by tumor data, we 
rely on a data set created by [17], with the specific goal of 
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Table 1 Detection accuracy and computation time of four methods on simulated normal samples 



CNVsize 


CNVtype 


PennCNV 


CBS 




Fused Lasso 


Group Fused Lasso 






TPR 


FDR 


TPR 


FDR 


TPR 


FDR 


TPR 


FDR 


5 


Deletion 


83.80 


4.92 


78.20 


0.68 


63.93 


1.74 


64.27 


1.83 




Duplication 


58.53 


4.67 


11.67 


10.26 


20.00 


37.76 


39.87 


14.33 


10 


Deletion 


95.03 


1.45 


88.37 


0.56 


88.50 


0.60 


88.87 


0.56 




Duplication 


93.43 


0.78 


56.50 


4.40 


83.90 


12.60 


91.60 


3.85 


20 


Deletion 


94.63 


0.58 


90.50 


0.39 


90.80 


0.47 


90.83 


0.47 




Duplication 


96.13 


0.92 


86.22 


3.58 


92.77 


4.95 


94.98 


2.13 


30 


Deletion 


94.57 


0.28 


93.30 


0.29 


89.38 


0.52 


89.77 


0.53 




Duplication 


96.09 


0.05 


90.77 


1.61 


94.32 


1.78 


94.98 


1.29 


40 


Deletion 


97.83 


0.59 


97.58 


0.09 


97.28 


0.19 


97.28 


0.19 




Duplication 


94.61 


0.46 


92.77 


0.98 


93.94 


1.15 


94.63 


0.75 


50 


Deletion 


94.33 


0.07 


92.76 


0.04 


90.47 


0.11 


90.48 


0.11 




Duplication 


94.50 


0.09 


93.81 


0.74 


93.11 


0.79 


93.64 


0.49 


Overall Deletion 


95.02 


0.55 


93.06 


0.19 


91.08 


0.33 


91.19 


0.34 


Overall Duplication 


93.82 


0.44 


86.92 


1.55 


90.56 


2.85 


92.46 


1.38 


Overall 




94.42 


0.49 


89.99 


0.85 


90.82 


1.60 


91.83 


0.87 



Time (sec.) 0.48 (0.01 ) 0.78 (0.69) 0.22 (0.1 3) 0.28 (0.05) 

TPR and FDR are measured as the percentage of related SNPs. Overall accuracy is calculated by pooling all sequences with a given type of CNV. Also reported are the 
average and standard deviation of the number of seconds required for the analysis of one sequence. 



studying the effect of contamination between normal and 
cancer cells. The HapMap sample NA06991, genotyped 
on the Illumina HumanHap550 array, was used to simu- 
late a cancer cell line by inserting a total of 10 structure 
variation regions, including one-copy losses, one-copy 
gains, and copy neutral loss-of-hetrozygosity (CN-LOH) 
(see Additional file 2: Table SI). The signal from this artifi- 
cial "tumor" sample was then contaminated in silico with 
that of the original "normal" sample, resulting in 21 data 
sets, with a percentage of normal cells ranging from 0% 
to 100%. Note that most simulated CNV or CN-LOH 
regions are very large (some spanning an entire chromo- 
some), and the challenge in detection is really due to the 
contamination levels. 

For ease of comparison, we evaluate the accuracy of call- 
ing procedures as in the original reference [17]. Sensitivity 
is measured for each variant region as the percentage 
of heterozygous SNPs that are assigned the correct copy 
number, and specificity is measured as the percentage 
of originally heterozygous SNPs in unperturbed regions 
that are assigned CN=2. We compare the performance 
of GFL to BAFsegmentation [17] and PSCN [53] repre- 
senting, respectively, a version of segmentation and HMM 
approaches specifically developed to deal with contami- 
nated tumor samples. Both of these algorithms have been 
tested with success on this simulated data set. 

Following other analyses, we do not pre-process the 
data prior to CNV detection. BAFsegmentation and PSCN 



were run using recommended parameter values. For each 
of the diluted data sets, we applied the GFL model on each 
chromosome, simultaneously using both LRR and mBAF, 
whose standard deviations are normalized to 1. Tuning 
constants are set to k\ = 0, A. 2 = 0.5 x 3 x ^JlogN, and 
A3 = 0.5 x 3 x yiogAf for chromosomes interrogated by 
N SNPs. The change points resulting from hard segmen- 
tation on LRR and mBAF are combined to make a finer 
segmentation of the genome. Finally, we adopt the same 
calling procedure described by [17]. For ease of compari- 
son with PSCN, only analyses of simulated tumor data are 
reported, even if BAFsegmentation and GFL would gain 
from using the genotype of the normal cell in defining 
mBAF. 

Figure 1 summarizes the sensitivity of each method, as 
a function of the percentage of normal cells in the sam- 
ple. Sensitivity is calculated for each of the 10 regions 
separately. All three methods work reasonably well under 
a wide range of percentages of normal cell contamina- 
tion. In 5 out of the 10 regions, GFL appears to lead 
to best results, while in the other 5, PSCN does. The 
CNV region involving the fewest SNPs is the hemizy- 
gous loss on Chromosome 13. In this case GFL in our 
hands behaved in the most stable manner. GFL outper- 
forms the two comparison methods in terms of specificity 
(Figure 2). While the specificity values might appear very 
high in any case, this is somewhat an artifact of how 
we define this index. In the interest of fairness, it is 



Zhang etal. BMC Bioinformatics 2012, 13:205 
http://www.biomedcentral.eom/1 471 -21 05/1 3/205 



Page 9 of 15 



■■io.6 

£0.4 
CD n 9 
« > « 
0 



CN-LOH chr 5p (2756) 



30 40 50 60 70 

Normal cell contamination (%) 



-f=f- 



^0.6 
£0.4 

cd n o 



1-copy loss chr 5q22 (79) 



T T T T T 



20 30 40 50 60 70 

Normal cell contamination (%) 



~0.6 
£0.4 
CDn 9 



1-copy gain chr 8p (3830) 



20 30 40 50 60 70 

Normal cell contamination (%) 



~0.6 
£0.4 
0.2 
0 



C/5 



1-copy gain chr 8q24 (91) 



T T T T T T T 



30 40 50 60 70 

Normal cell contamination (%) 



■■io.6 

£0.4 

CD n o 

o 



1-copy loss 9p (3889) 



30 40 50 60 70 

Normal cell contamination (%) 



T— T— T 



■-10.6 
£0.4 

CD n 9 

0 



1-copy loss chr 10q23.1-q23.33 (648) 



T T T T T 



30 40 50 60 70 

Normal cell contamination (%) 



foi 

~0.6 
£0.4 
CDn 2 

05 0 



Trisomy chr 12 (8818) 



T— T— T— T— T 



20 30 40 50 60 70 

Normal cell contamination (%) 



~0.6 
£0.4 
CD n 2 
W 0 



1-copy loss chr 13q13.1 (10) 



-T=t- 



-T=f- 



T T T T T 



30 40 50 60 70 

Normal cell contamination (%) 



■■io.6 

£0.4 

CD n 9 



CN- LOH chr 17p13.1-p12 (308) 



30 40 50 60 70 

Normal cell contamination (%) 



-T=f- 



■-io.6 

£0.4 

CD n 9 



CN-LOH 17q (3205) 



20 30 40 50 60 70 

Normal cell contamination (%) 



PSCN 



BAFsegmentation - 



GFL 



Figure 1 Sensitivity as a function of percentage contamination by normal cells in the 1 0 different simulated CNV regions. Sensitivity is not 
defined at 100% contamination. 



relevant to note that the performance of PSCN in our 
hands did conform to the published performance [53]. 
While we tried our best to set the parameter values, we 
have not succeeded in replicating the authors' original 
results. 

PSCN, like GFL, is implemented in R with some com- 
putationally intensive subroutines coded in C BAFseg- 
mentation relies on the R package DNAcopy, whose core 
algorithms are implemented in C and Fortran. BAFseg- 
mentation wraps these in Perl. A comparison of run times 
indicate that GLF and BAFsegmentation are comparable, 
while PSCN is fifty times slower than GFL (see Additional 
file 3: Table S2). 

One sample assayed with multiple replicates and multiple 
platforms 

We use the data from a study [54] assessing the per- 
formance of different array platforms and CNV calling 



methods to illustrate the advantages of joint analysis of 
multiple measurements on the same subject. DNA from 
four individuals was analyzed in triplicate on each of 5 
platforms: Affymetrix 6.0, Illumina 1M, 660W, Omnil- 
Quad (OIQ) and Omni2.5-Quad (02Q) (among others 
[54]). We use the results on the first three to define "true" 
copy numbers and try to reconstruct them using data from 
OIQ and 02Q. The nine "reference" experiments were 
analyzed with 4 or 5 CNV calling algorithms [54] and a 
CNV was identified using majority votes. Consistent evi- 
dence was required from at least 2 analysis tools, on at 
least 2 platforms, and in at least 2 replicates (see Addi- 
tional file 4: Table S3). Here overlapping CNVs detected 
in two replicates/algorithms/platforms are collapsed to a 
single CNV. 

The test experiments are based on 1,020,596 and 
2,390,395 autosomal SNPs, which after quality control 
reduce to a total of 2,657,077 unique loci. Since our focus 
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Figure 2 Specificity as a function of percentage contamination by normal cells. Note that [53] reports better performance of PSCN in dealing 
with contamination levels of 85%, 95% and 100%. 



here is to investigate how to best analyze multiple signals 
on the same subject, rather than on the specific properties 
of any CNV calling method, we carry out all the analy- 
ses using different settings of GFL in segmentation while 
keeping the same CNV calling and summarizing proce- 
dures. All segmentation is done on LRR only, while calling 
procedure uses both LRR and BAF (with cut-off r\ = 10 
and r2 = 1). Here we compare three segmentation settings 
to analyze these 6 experiments per subject (see Additional 
file 5: Table S4 for more details about tuning parameters): 

1. The signals from the three technical replicates with 
one platform are averaged and then segmented and 
subjected to calling procedure separately. The final 
CNV list is the union of CNV calls from the two 
platforms. 

2. The signals from the three technical replicates with 
one platform are each segmented and separately 
subjected to calling. A majority vote of at least two 
out of three is used to summarize each CNV result 
for each platform. The final CNV list is the union of 
the two platforms' lists. 

3. The signals from the three technical replicates of both 
platforms (6 LRR sequences) are segmented jointly. 



Calling is still done on each replicate separately, and 
the same majority vote is used to summarize the 
CNV result for each platform. Again, the final CNV 
list is the union of the two platforms' results. 

To benchmark the result of joint analysis, we use 
MPCBS [20], a segmentation method, specifically 
designed for multi-platform CNV analysis. The segments 
output from MPCBS are subjected to the same calling, 
majority voting, and summarizing procedures. 

Table 2 presents our results. Averaging from different 
technical replicates leads to loss of power, while joint 
analysis of all the signals leads to the most effective per- 
formance. GFL joint analysis leads to results comparable 
to those of MPCBS, but it is at least 30 times faster than 
MPCBS. 

Multiple related samples assayed with the same platform 

In the context of a study of the genetic basis of bipo- 
lar disorder, the Illumina Omni2.5-Quad chip was used 
to genotype 455 individuals from 11 Columbian and 13 
Costa Rican pedigrees. We use this data set to explore the 
advantages of a joint segmentation of related individuals. 
In the absence of a reference evaluation of CNV status in 



Table 2 Comparison of four CNV analyses on four real normal samples 



Analysis 


NA15510 


NA18517 


NA18576 


NA18980 


Time (min.) 


#Det. 


#Ovlp. 


#Det. 


#Ovlp. 


#Det. 


#Ovlp 


#Det. 


#Ovlp 


Analysis 1 


170 


38 


144 


34 


160 


25 


145 


22 


1.2 


Analysis 2 


102 


36 


109 


33 


93 


25 


91 


20 


3.7 


Analysis 3 


80 


38 


82 


32 


69 


25 


56 


15 


8.5 


MPCBS 


98 


34 


88 


28 


59 


18 


68 


21 


313.9 



The number of CNV detected (Det.) and overlapping (Ovlp.) and the average computation time (in minutes) for each sample under the different analyses. 
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these samples, we rely on two indirect methods to assess 
the quality of the predicted CNVs. We used the collec- 
tion of CNVs observed in HapMap Phase III [5] to compile 
a list of 426 copy number polymorphisms (selecting all 
those CNVs with frequency > 0.05 in pooled samples 
from 11 populations) and assumed that if we identify in 
our sample a CNV corresponding to one of these regions, 
we should consider it a true positive. For the purposes 
of this analysis we considered a detected CNV to cor- 
respond to one identified in HapMap if there was any 
overlap between the two regions. 

Another indirect measure of the quality of CNV calls 
derives from the number of Mendelian errors encoun- 
tered in the pedigrees when we consider the CNV as a 
segregating site. De novo CNVs are certainly a possibil- 
ity, and in their case Mendelian errors are to be expected. 
However, when the CNV in question is a common one 
(already identified in HapMap), it is reasonable to expect 
that it segregates in the pedigrees as any regular polymor- 
phism. We selected a very common deletion on Chromo- 
some 8 (HapMap reports overall frequency > 0.4 in 11 
populations) and compared different CNV calling proce- 
dures on the basis of how many Mendelian errors they 
generate. 

As mentioned before, PennCNV represents a state-of- 
the-art HMM method for the analysis of normal sam- 
ples and, therefore, we included it in our comparisons. 
However, the parameters of the underlying HMM algo- 
rithm had not been tuned on the Omni2.5-Quad at this 
time, resulting in sub-standard performance. Segmenta- 
tion methods are less dependent on parameter optimiza- 
tion; hence, GFL analysis of LRR and BAF one subject 
at a time can provide a better indication of the potential 
of single-sample methods. We considered two multiple- 
sample algorithms: GFL and MSSCAN [16], both applied 
on LRR with the group structure denned by pedigree 
membership. (While a trio-mode is available in PennCNV 
[55], this does not adapt to the structure of our fami- 
lies.) A final qualification is in order. While the authors 
of MSSCAN kindly shared with us a beta-version of their 
software, we did not find it to be robust. Indeed, we were 
unable to use it to segment the entire genome. However, 
we successfully used it to segment Chromosome 8, so that 
we could include MSSCAN in the comparison based on 
Mendelian errors. 



Prior to analysis, the data was normalized using the GC- 
content correction implemented in PennCNV [29]. For 
individual analysis, the GFL parameters were k\ = 0.1, 
A, 2 = 0, and A3 = 2 x yiogAf, where N is the number of 
SNPs deployed on each chromosome; for pedigree analy- 
sis, the GFL parameters were k\ = 0.1, A 2 = 0.5 x 2 x 
7 log TV, and X 3 = 0.5 x 2 x V0.3M x ylogAf, where M is 
the number of individuals in each pedigree. For MSSCAN, 
CNV size is constrained to be fewer than 200 SNPs, and 
the maximum number of change points is set at 50. The 
calling procedure with r\ = 10 and r<i = 1 was applied to 
both the GFL and MSSCAN results. 

Table 3 summarizes the total number of copy number 
polymorphisms (CNPs) identified in our sample by dif- 
ferent approaches and their overlap with known CNPs 
from HapMap. For the purpose of this comparison we 
considered a variant to be a CNP when its frequency in 
our sample was at least 10%. All analysis modes of GFL 
show a higher percent of overlap with the HapMap list 
than the PennCNV list. It is also clear that GFL pedigree 
analysis achieves a larger overlap with the HapMap data 
than the GFL individual analysis. The time cost per sample 
is reasonable and scales well with the increment of sample 
size. 

Table 4 summarizes the results of our investigation of a 
154kb CNP region on Chromosome 8p (from 39,351,896 
to 39,506,122 on NCBI Build 36 coordinate). All methods 
but PennCNV show detected deletions only; this coin- 
cides with the observation from HapMap data. We used 
option Mistyping of Mendel (version 11.0) [56,57] to 
detect Mendelian errors. Joint segmentation methods dis- 
cover more hemizygous deletions than individual analysis, 
resulting in fewer Mendelian errors. MSSCAN discovers 
the largest number of hemizygous deletions. Figure 3 
shows four families derived from a large pedigree where 
3 out of 4 Mendelian errors are removed by joint 
analysis. 

Conclusions 

We have presented a segmentation method based on 
penalized estimation that is capable of processing mul- 
tiple signals jointly. We have shown how this leads to 
improvements in the analysis of normal samples (where 
segmentation can be applied to both total intensity and 
allelic proportions), tumor samples (where we are able to 



Table 3 Comparison of three CNV analyses in the bipolar disorder study 

Method #Detected CNVR #Overlap %Overlap Time(min.) 

PennCNV 189 63 33.33% 3.44 

GFL-lndividual (LRR+BAF) 95 50 52.63% 3.90 

GFL-Pedigree (LRR) 106 62 58.49% 1.57 

The number and overlap of CNP regions with frequency > 0.1 detected in our sample by different methods. These CNP regions were compiled from HapMap. 
Computation time is given in minutes per sample. 
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Table 4 Detected CNVs in a common deletion on Chromosome 8 



Method 


#CN=0 


#CN=1 


#CN=3 


#Families with 
Mendelian errors 


Time (min.) 


PennCNV 


125 


39 


102 


35 


0.19 


GFL-lndividual 


123 


97 


0 


20 


0.21 


GFL-Pedigree 


123 


137 


0 


15 


0.09 


MSSCAN-Pedigree 


123 


154 


0 


15 


0.11 



Across the various algorithms, subjects are assigned to one of 4 copy numbers. For each algorithm, we report the total numbers of CN/ 2 identified, the total number 
of nuclear families with Mendelian errors, and the average computation time (in minutes) per sample for the analysis of Chromosome 8. 



deal with contamination effectively), measurements from 
multiple platforms, and related individuals. Given that 
copy number detection is such an active area of research, 
it is impossible to compare one method to all other avail- 
able methods. However, for each of the situations we 
analyzed, we tried to select state of the art alternative 
approaches. In comparison to these, the algorithm we 



present performs well. Its accuracy is always comparable 
to that of the most effective competitor and its computa- 
tion times are better contained. Given its versatility and 
speed, GFL is, in our opinion, particularly useful for initial 
screening. 

There are of course many aspects of CNV detection, 
ranging from normalization and signal transformation to 
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Figure 3 CNV detection and Mendelian errors for a Central American pedigree. Displayed are four families derived from an extended pedigree. 
Circles and squares correspond to females and males. The dashed line is used to indicate identical individuals. Beneath each individual, from top to 
bottom, are CNV genotypes by PennCNV and by GFL. The subjects for whom PennCNV and GLF infer different CNV genotypes are highlighted in red 
and blue. Red is used when PennCNV genotypes result in Mendelian error, while GFL genotypes do not. Blue is used when both genotypes are 
compatible with Mendelian transmissions. Orange singles out a member for whom both PennCNV and GFL genotypes result in Mendelian error. 
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FDR control of detected CNV, that we have not analyzed 
in this paper. There are also a number of improvements 
to our approach that appear promising, but at this 
stage are left for further work. For example, it is easy to 
modify algorithms so that the penalization parameters 
are location dependent and incorporate prior informa- 
tion on known copy number polymorphisms. It will 
be more challenging to develop theory and methods to 
select the values of these regularization parameters in a 
data-adaptive fashion. 

Finally, while our scientific motivation has been the 
study of copy number variations, the joint segmentation 
algorithm we present is not restricted to specific 
characteristics of these data types, and we expect it will be 
applied in other contexts. 

Implementation and availablity 

We have implemented the segmentation routine, which is 
our core contribution, in an R package (Piet) available at 
R- forge [58]. To demonstrate a visualization of the CNV 
results on Chromosome 8 in the bipolar disorder study, 
we refer readers to Additional file 6: Figure SI. 



Additional files 



Additional file 1 : Supplementary Text. Specification of surrogate 
function, justification of choice of tuning parameters, details of calling 
procedure. 

Additional file 2: Table SI. Regions of allelic imbalance imputed to the 
HapMap sample NA06991. 

Additional file 3: Table S2. Speed comparison of three methods: GFL, 
BAFsegmentation and PSCN. 

Additional file 4: Table S3. Sample information and reference CNV 
regions summarized for each sample by their types and sizes. 

Additional file 5: Table S4. Summary of results for four real samples 
under different CNV analyses. 

Additional file 6: Figure SI . Visualization of pedigree-wise CNV analysis 
results of Chromosome 8 data in the bipolar disorder study. In the main 
body of the plot, CNVs estimated for each individual are marked by small 
segments with color code: CN=0 in blue, CN=1 in light blue, CN=3 in red 
and CN=4 in brown. Each subject is a row, each SNP a column. Subjects 
belonging to the same pedigree are stacked together. The pedigree names 
are indicated on the left-hand side with the number of pedigree members 
included in parentheses. On the right-hand side, the barplot represents the 
number of CNVs detected per subject. Two shades of green are switched 
alternately to indicate the pedigree to which the subject belongs. At the 
bottom, the gray histogram shows the GC content along the chromosome. 
Coordinated with the representation of CNVs in the main body, the green 
histogram counts the frequency of CNVs among the subjects represented. 
Vertical dotted line marks the centromere. 
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